In this exercise, we will be using functions from the tidyverse package. You can see we’ve added the chunk option message = FALSE to hide the version information that tidyverse normally displays.

library(tidyverse)

(a) Experiment with figure sizing

Pick one of the plots you’ve made so far in exercise 1.3 or 2.3.

Try changing fig.width, fig.height and dpi in the code chunk options and see what happens.

Can you use these to make a plot with very small text? A plot with very large text?

icecore <- read_csv("icecore.csv")
ggplot(icecore, aes(x = air_age_before_2008, y = CO2_ppm, colour = core)) +
  geom_point()

ggplot(icecore, aes(x = air_age_before_2008, y = CO2_ppm, colour = core)) +
  geom_point()

(b) Save a plot to a file

Copy and paste the code you wrote to make a plot for part (a), and then save the plot to a PNG file using ggsave().

ggplot(icecore, aes(x = air_age_before_2008, y = CO2_ppm, colour = core)) +
  geom_point()

ggsave("Ex_3_2_b.png", width = 4, height = 3, dpi = 600)

(c) Compare boxplots to means and confidence intervals

The file ipf_lifts_raw.csv contains the results from a large number of International Powerlifting Federation (IPF) meets. This data was sourced from Open Powerlifting, via Tidy Tuesday 2019-10-08. This is a further subset of the Tidy Tuesday data, containing only “raw” powerlifting competitions (no equipment such as wraps or straps allowed) and competitors whose age was known and under 80 years old.

Powerlifting competitions are judged on each lifter’s “total”, which is the sum of the weight lifted on three lifts: the squat, the bench press and the deadlift.

In this exercise, we will investigate the relationship between powerlifting total (in variable total_lifted_kg) and age (in variable age_class) for each gender (in variable sex).

ipf_lifts_raw <- read_csv("ipf_lifts_raw.csv")
  1. Make a boxplot, facetted by sex.
ipf_lifts_raw %>%
  drop_na(age_class) %>%
  ggplot(aes(y = age_class, x = total_lifted_kg)) +
  geom_boxplot() +
  facet_wrap(vars(sex), ncol = 2) +
  scale_y_discrete(limits = rev)

  1. Plot means and error bars showing 95% confidence intervals using stat_summary, also facetted by sex.

Do these plots tell a different story?

ipf_lifts_raw %>%
  drop_na(age_class) %>%
  ggplot(aes(y = age_class, x = total_lifted_kg)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width = 0.5) +
  stat_summary(fun.data = "mean_cl_normal", geom = "point") +
  facet_wrap(vars(sex), ncol = 2) +
  scale_y_discrete(limits = rev) +
  labs(caption = "Error bars show 95% confidence intervals for the mean.")

(d) Extension: Add annotations to the ice core plot

The code below makes a plot for a subset of the ice core data you saw in exercise 1.3.

Modify it to include direct annotations of the three ice cores (DSS, DE08, DE08-2) instead of a legend.

Hint: it is probably easiest to create a data frame for the annotations, using tribble().

icecore <- read_csv("icecore.csv")

Manually specifying locations for text:

icecore_text <- tribble(
  ~core,    ~air_age_AD, ~CO2_ppm, ~hjust, ~vjust,
  "DE08-2", 1960,        330,      1,      0.5,
  "DE08",   1860,        295,      1,      0.5,
  "DSS",    1200,        290,      0.5,    1
)
icecore %>%
  filter(core != "Vostok") %>%
  ggplot(aes(x = air_age_AD, y = CO2_ppm, colour = core)) +
  geom_point() +
  geom_text(data = icecore_text, 
            aes(label = core, hjust = hjust, vjust = vjust)) +
  labs(x = "Air age (year A.D.)",
       y = "CO2 concentration (ppm)") +
  theme(legend.position = "off")

Calculating locations for text using code:

icecore_text <- icecore %>%
  filter(core != "Vostok") %>%
  group_by(core) %>%
  summarise(air_age_AD = max(air_age_AD),
            CO2_ppm = max(CO2_ppm)) %>%
  ungroup()
icecore %>%
  filter(core != "Vostok") %>%
  ggplot(aes(x = air_age_AD, y = CO2_ppm, colour = core)) +
  geom_point() +
  geom_text(data = icecore_text, 
            aes(label = core),
            hjust = 1, vjust = 1, nudge_x = -30, nudge_y = -2) +
  labs(x = "Air age (year A.D.)",
       y = "CO2 concentration (ppm)") +
  theme(legend.position = "off")

(e) Extension exercises

The questions below relate to the powerlifting data, continuing on from question (c).

Extension: Make another plot showing means plus or minus two standard deviations. Is this closer to the boxplot or closer to the 95% confidence intervals? (Roughly what percentage of a normal distribution would you expect to be within two standard deviations of the mean?)

ipf_lifts_raw %>%
  drop_na(age_class) %>%
  ggplot(aes(y = age_class, x = total_lifted_kg)) +
  stat_summary(fun.min = ~ mean(.) - 2 * sd(.), 
               fun.max = ~ mean(.) + 2 * sd(.),
               geom = "errorbar", width = 0.5) +
  stat_summary(fun = ~ mean(.), geom = "point") +
  facet_wrap(vars(sex), ncol = 2) +
  scale_y_discrete(limits = rev) +
  labs(caption = "Error bars show two standard deviations from the mean.")

Extension: Make one of these plots for weight_class_kg instead of age_class. What went wrong here? (It’s okay if you can’t fix it yet!) What do you think is happening with the weight classes for men?

ipf_lifts_raw %>%
  drop_na(age_class) %>%
  ggplot(aes(y = fct_inseq(weight_class_kg), x = total_lifted_kg)) +
  geom_boxplot() +
  facet_wrap(vars(sex), ncol = 2, scales = "free_y") +
  scale_y_discrete(limits = rev)


© 2021 Statistical Consulting Centre, The University of Melbourne.